This project was inspired by the multiple time-series analysis tools
I learned at UCLA which I wanted to refresh my memory on. The dataset I
selected compares diamond and gold price data with other variables
everyday between March 2018 and June 2021 thereby being 1134
observations across 6 variables. This means that the dataset is large
and very useful for the time-series analysis project. Furthermore, while
this project initially focuses on the diamond price
variable, I will still be modeling and comparing the other variables.
Specifically, I will firstly start by understanding the data and
variables through simple visualizations and tables as they are in the
dataset. Then, I will convert these into time-series using the
library(tseries) and analyze the ACF, PACF, and
stationarity of each time-series. Finally, I will run three different
models on each variable, these being an Autoregressive (AR),
Autoregressive Distributed Lag Model (ARDL), and Vector Autoregressions
(VARs). The goal of this analysis is to understand which of these models
best fits the data and produces the most accurate forecast of the
variables.
dia <- read_csv("diamond_data_merged_with_other_variables.csv")
This dataset was originally sourced from Kaggle and compiled by user Sibelius_5. It is constituted of 1134 observations across 6 variables with these variables being date, diamond price, inflation rate, interest rate, fed rate, and gold price. For the rates, it is important to note that this is within the United States, with the fed rate referring to the federal funds rate. In order to gain a basic understanding of these variables, I will plot these variables across time, compare their histograms, and examine their summary tables.
Since late 2018, the plot shows that there has been a continual
decline in diamond prices until mid 2020. From that point on, there has
been a constant increase until the end of timespan in the data set. The
smooth fit from the library(ggplot2) fits the data
seemingly very well.
The plot for gold price is seemingly inverse to the diamond price plot, with a gradual increase in price since late 2018 peaking in late 2020. The plot points seem somewhat more spread out than with diamond price and this is range of prices are way lower.
Inflation rates between 2018 and 2020 declined somewhat from slightly above 2.0% to around 1.5%. It is interesting to note the large dip on the week of the 16th March, 2020 which coincides with the start of mandated lockdowns in the United States during the COVID-19 pandemic. From that week on, there was a fairly steep increase until the end of data set.
Essentially throughout the span of the data, there has been a gradual decline eventually dipping into the negative at the beginning of 2020. Furthermore, the dip in inflation rates coincides with the sharp peak in interest rates during the same week. The trends in interest and inflation during this time reflect the supply chain issues during this period.
The sharp decline in the federal funds rate shows the Federal Reserve’s attempt to try stimulate the economy by increasing the money supply. It is interesting to note that since the federal funds rate is determined by the Federal Reserve, the points are clearly distributed in a somewhat different way than in the other graphs.
The histogram for diamond prices reveals that there are four different peaks. The biggest peaks are around 9800 USD and 10300 USD and it broadly ranges between 9250 and 11000.
Gold prices have a sharp peak around 1300 USD. It is interesting that there seem to be peaks around the uneven 200 USD intervals (ie 1300, 1500, etc.). These prices range between 1100 USD and 2100 USD.
The most common frequencies of inflation rates lie between 1.5% and 2.2%. Indeed, the biggest peak is around 2.2% which makes sense given that the Fed usually sets 2% as the target for their inflation rates.
There are three clear peaks in interest rates around -1.0, 0.15, 0.75. It’s interesting to contrast this with the inflation as that distribution is evidently more skewed to one side whereas interest fluctuates a lot more. This is likely explained by the fact that the Fed does not directly target a specific interest rate. While not relevant for our time-series model, these trends that reflect the fiscal policies of the Fed are a very interesting curiosity.
The stratification of the federal funds rate, which was already visible in the scatterplot, is made much more clear by the histogram. The frequency of the very very low rate around 0.05 shows the monetary policy of the Fed trying to increase the money supply.
summary(dia$`diamond price`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9272 9766 10005 10016 10305 10805
The median diamond price between 2018 and 2021 was 10005 USD with the mean price being very close at 10016 USD.
summary(dia$`gold price`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1170 1293 1502 1538 1776 2064
For gold prices, the median was 1502 USD and, again, the mean was only slightly higher at 1538 USD.
summary(dia$`inflation rate`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 1.630 1.780 1.806 2.100 2.540
Inflation rates had a median percentage of 1.78% and mean of 1.81%. Given that the target Fed inflation was around 2%, it seems to have slightly missed it.
summary(dia$`interest rate`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.08000 -0.75750 0.12000 0.01936 0.75000 1.17000
Median interest rates between 2018 and 2021 were 0.12% while the mean was 0.02%. It fluctuates quite evenly between the maximum and minimum values.
summary(dia$`fed rate`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04 0.09 1.58 1.26 2.19 2.45
The median federal funds rate was 1.58% with the mean coming in even lower at 1.26%. This makes sense given the frequency of the very low fed rate at 0.05% bringing the mean down.
ts_data <- ts(dia)
# Keeping this here to demonstrate how it was coded.
# The subsequent code will not be shown for this section.
diapr.ts <- ts(dia$`diamond price`, start=c(1), end=c(1134), frequency=6)
tsdisplay(diapr.ts)
adf.test(diapr.ts)
##
## Augmented Dickey-Fuller Test
##
## data: diapr.ts
## Dickey-Fuller = -2.6668, Lag order = 18, p-value = 0.2959
## alternative hypothesis: stationary
The stationarity plot tells us that diamond price returns to its mean and has a constant variance. However, the ACF plot is very concerning and shows that the data must be transformed as currently there are serially correlated. errors. Subsequent plots will be done for the sake of completion but they all illustrate the same problem with their ACF plots.
The Augmented Dickey-Fuller (ADF) test reflects this too with a high p-value rejecting the null hypothesis that the data is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: goldpr.ts
## Dickey-Fuller = -3.2043, Lag order = 18, p-value = 0.08708
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: infl.ts
## Dickey-Fuller = -4.4292, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: intr.ts
## Dickey-Fuller = -3.1836, Lag order = 18, p-value = 0.09065
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: fedr.ts
## Dickey-Fuller = -2.9221, Lag order = 18, p-value = 0.1878
## alternative hypothesis: stationary
Given the problems with the ACF, it is important to transform the data in order to correct for this. I will do this through differencing which is often useful for removing any trends or seasonality within the data, as well as then taking logs of these differences
# First-order differencing for daily data
differenced_data <- as.data.frame(diff(ts_data))
differenced_data$diamond <- rev(differenced_data$`diamond price`)
differenced_data$gold <- rev(differenced_data$`gold price`)
differenced_data$interest <- rev(differenced_data$`interest rate`)
differenced_data$inflation <- rev(differenced_data$`inflation rate`)
differenced_data$fed <- rev(differenced_data$`fed rate`)
# Keeping this here to demonstrate how it was coded.
# The subsequent code will not be shown for this section.
diampr.ts <- ts(differenced_data$diamond, start=c(1), end=c(1134), frequency=6)
tsdisplay(diampr.ts)
adf.test(diampr.ts)
## Warning in adf.test(diampr.ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diampr.ts
## Dickey-Fuller = -15.128, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
The stationarity plot tells us that diamond price returns to its mean and has a constant variance. The ACF plot seems to have no patterns with significant lags around 0, 5, 18, 23, and 28. The PACF is similar in distribution, with the negative lags being more significant.
The ADF test gives a p-value less than 0.05 thus we accept the null hypothesis and conclude that the data is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: goldpr.ts
## Dickey-Fuller = -20.36, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
The data for gold price is stationary, with constant variation away from the mean. According to the PACF, the first two lags are significant with others roughly around 6, 10, 16, 20, and 35. The ADF test reinforces this by resulting in a small p-value thus this is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: infl.ts
## Dickey-Fuller = -18.397, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
Inflation rate is stationary. According the the PACF, there is no pattern to which lags are significant, with most under 6 being relevant as well as many around 25 and 35. The ADF test also has a small p-value so we conclude inflation is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: intr.ts
## Dickey-Fuller = -23.461, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
Interest rate is also stationary as shown by the ADF p-value. From the PACF, the first few lags are important, with some later lags also having higher importance than the rest.
##
## Augmented Dickey-Fuller Test
##
## data: fedr.ts
## Dickey-Fuller = -14.758, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
The ADF test shows that the data is stationary. The ACF and PACF show that there are few significant lags with the only ones being around 13, 15, and 24.
Therefore, the differencing method was successful in transforming the data but we must keep in mind that the subsequent analysis will be on the differences of our variables instead of the direct variable.
In this section, we will look at how our variables fit into autoregressive (AR) models and evaluating the performance of each by analyzing the residual plots of the time-series, then forecasting our variable 10 days into the future.
Autoregressive models are powerful when used with stationary data and for simple short-term forecasting where external variables do not play a significant role. They take the general form:
\(X_t=ϕ_1 X_{t-1} + ϕ_2 X_{t-2} + ...+ ϕ_p X_{t-p} +ε_t\)
This means that while we expect these models to be useful and their forecasts somewhat accurate, it is likely that most of these variables will perform better with ARDL models. This is because we know that external factors do play a role, especially in how inflation and interests affect prices. Nevertheless, it is important to first run these models so we can compare them and test this hypothesis.
dia.ts <- ts(differenced_data, start = c(1), end=c(1134), frequency = 6)
# Setting y variable - diamond price
ydp <- differenced_data[,"diamond"]
dia.ar = ar(ydp, aic=TRUE, order.max = NULL, method = "ols")
dia.ar$aic
## 0 1 2 3 4 5 6 7
## 11.100190 0.000000 2.879114 5.465032 7.256581 7.117391 9.915862 12.507309
## 8 9 10 11 12 13 14 15
## 14.278973 16.784077 18.616339 21.161695 23.942707 26.454687 29.325116 30.922889
## 16 17 18 19 20 21 22 23
## 33.645523 32.255396 35.169558 35.964046 38.058539 36.967126 37.895147 28.657095
## 24 25 26 27 28 29 30
## 30.096548 31.691337 34.193052 36.590813 35.495486 36.508323 39.233292
dia.ar = ar(ydp, aic=TRUE, order.max = 1, method = "ols")
tsdisplay(dia.ar$resid)
Firstly, we see that the AIC of the different lags is lowest at 1.
Therefore, we set the order.max = 1.
The ACF and PACF of these residuals are mostly positive, and the most positively correlated lags are just after lag 20. However, according to the PACF the only significant lag is 23 while the ACF also has 28 as significant as well.
# Creating test and train data sets
set.seed(12)
Row.number <- sample(1:nrow(differenced_data), 0.67*nrow(differenced_data))
Train = differenced_data[Row.number,]
Test = differenced_data[-Row.number,]
# Dimensions of test and train
dim(Train)
## [1] 759 11
dim(Test)
## [1] 374 11
# Regressions and testing with MSE
dia.train = lm(Train$diamond~ Train$date, data=Train)
dia.test = lm(Test$diamond~Test$date, data=Test)
plot(residuals(dia.train), main = "Residuals of Train Data")
mean((Train$diamond - predict(dia.train, Train)) ^2)
## [1] 79.30967
mean((Test$diamond - predict(dia.test, Test)) ^2)
## [1] 657.6424
The test-train cross-validation we used divides our data set into a train (67%) and test (33%) sets. By fitting the autoregressive model of diamond price onto the train and test data sets, and then comparing the MSE’s of each fit, we can determine whether the model is good in a general setting or only works within our data. With a train MSE of 79.3 and test MSE of 657.6 which suggests that for diamond price the model is severely overfit and not good in a general setting.
dia.fr <- forecast(dia.ar, 10)
plot(dia.fr, main = "Diamond Price Differences")
Here, we are forecasting the diamond price differences 10 days into the future and it seems to follow on seamlessly from the pre-existing time-series plot, which is very encouraging. However, it might also be useful to look at the diamond price itself, rather than the differences. This will require some transformation of the data which is shown below.
differences <- c(dia.fr$mean)
# Loop creating the series of forecasted diamond prices based on the differences.
series <- vector(length = 5)
series[1] <- 10385.16
for (i in 2:10) {
series[i] <- series[i-1] + differences[i-1]
}
# Transforming this into a dataset and combining with the dates.
forecasteddiamonds <- c(dia$`diamond price`, series)
forecasteddiamonds <- as.data.frame(forecasteddiamonds)
forecasteddiamonds$dates <- c(as_date(c("2021-06-14","2021-06-13","2021-06-12","2021-06-11","2021-06-10","2021-06-09","2021-06-08","2021-06-07","2021-06-06","2021-06-05")), dia$`date`)
forecasteddiamonds$forecast <- forecasteddiamonds$dates > "2021-06-05"
# Plotting between April and June 2021.
frplot.dia <- ggplot(data = forecasteddiamonds, aes(x = dates, y = forecasteddiamonds, group = forecast)) +
geom_point() +
geom_smooth(aes(color = forecast), method = "loess") +
xlim(as.Date("2021-04-01"), as.Date("2021-06-14")) + ylim(10000, 10500) +
ggtitle("Diamond Price Forecast (April 2021 - June 2021)") +
xlab("Date") + ylab("Price (USD)")
frplot.dia
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1069 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1069 rows containing missing values (`geom_point()`).
Therefore, the forecast shows that diamond prices will continue to increase at more or less the same rate as it already has been throughout June. This forecast is only looking 10 days into the future, so around a week, thus this makes sense as we would not expect large changes within a week unless there is some anomalous occurrence.
## 0 1 2 3 4 5 6 7
## 17.087106 12.278404 0.000000 2.059499 2.514869 5.177340 2.761349 4.333829
## 8 9 10 11 12 13 14 15
## 3.867678 4.573359 6.809589 5.776145 7.417946 10.022494 12.951439 15.221819
## 16 17 18 19 20 21 22 23
## 14.430339 15.729118 17.301671 18.794466 12.776415 12.806140 11.940438 9.634692
## 24 25 26 27 28 29 30
## 12.186047 15.023708 17.121652 19.343841 22.027558 24.446079 27.077527
The AIC is lowest at an order of 2 therefore, that will be our max
order.
The ACF and PACF of these residuals are mostly negative, and the most negatively correlated lags are around 5 to 10. However, according to the PACF the only significant lags are 6 and 22 while the same is true of the ACF.
## [1] 281.4056
## [1] 286.2197
For gold price, we found that our train MSE is 281.4 and our test MSE is 286.2. These are significantly closer than in our diamond price model and indicate that our model works well and can generalize to new data. Therefore, within the data our diamond price model was better but outside of the data (ie in general) the gold price model performs better.
Just as we did with the diamond prices, we are forecasting gold price difference 10 days into the future. While our forecasting has significantly less variation, it continues on from the previous plot which is a good sign.
Interestingly, our forecast predicts that the price will peak at around 1900 USD and, given the margins of error, hover around this price or even decline slightly. This reflects the higher volatility of gold prices compared to diamond prices as we can see from the previous months so our forecasts seem to make sense.
## 0 1 2 3 4 5 6
## 24.9875010 15.4624374 16.3128260 18.7452537 19.8560330 21.1350677 0.9513299
## 7 8 9 10 11 12 13
## 3.4135918 0.0000000 1.8714794 2.2204661 4.6529638 4.9712176 2.9333980
## 14 15 16 17 18 19 20
## 3.8351698 5.7841908 8.7739801 9.4808410 12.1659154 13.4444553 7.0811511
## 21 22 23 24 25 26 27
## 9.9154787 9.0366341 11.2172110 14.0614449 13.1014746 3.5024860 2.5589720
## 28 29 30
## 2.2341897 4.9681541 6.6504045
The AIC is lowest at an order of 28 therefore, that will be our max
order.
Our ACF and PACF residual plots show no significant lags, indicating that our model fits our data well and will likely perform well in the train-test cross-validation as our residuals have no structure.
## [1] 0.0007683683
## [1] 0.0005920851
Our results give us the lowest MSEs thus far for both train and test, however it is important to note that these are relative to the range of values of the inflation range. The train MSE is 0.00077 while the test MSE is 0.00056. This again reinforces that our model is doing well to generalize to new data and is not overfitting.
Again, the plot of the forecasted differences fits our data without the
noise and variation, which is to be expected from a forecast.
The plot for how inflation rate is predicted to change shows that our
forecast remains within the range of the previous week, with a
prediction that it will hover around 2.4%. Previously, we found that the
Fed targets an inflation rate of around 2% and our forecast follows on
fairly seamlessly which is a good indication that our inflation rate
autoregressive model works well.
## 0 1 2 3 4 5 6 7
## 31.641574 14.825134 15.160085 0.000000 2.869723 4.739321 7.056664 9.578116
## 8 9 10 11 12 13 14 15
## 12.541864 12.498249 10.337554 5.687466 5.206296 5.740264 3.255439 5.996196
## 16 17 18 19 20 21 22 23
## 5.513696 5.145943 6.139311 8.273340 8.872885 11.812350 14.725415 16.767654
## 24 25 26 27 28 29 30
## 19.227910 17.571048 19.442085 21.625517 15.556952 18.285236 16.988902
For our inflation rate autoregression, the lowest AIC is at an order of
3 thus this will be our max order.
The ACF and PACF seem to trend more so towards the negative however, only two lags are considered significant by these plots: lags 11 and 14. Generally, there doesnt seem to be any structure to our residuals which is good.
## [1] 0.001353517
## [1] 0.001030378
The train MSE is 0.0012 while the test MSE is 0.0013 which shows that our model is overfitting our data as it performs better in training than testing.
Again, visually the forecast seems to fit our data well and continues the general trend without any notable anomalies.
When we extrapolated the differences to our actual rate, the forecast seems to indicate that interests will more or less level out during the 10 days its observing. However, its important to note that since the pandemic, the interest rate fluctuated significantly throughout the week, and the forecast seems to capture this fluctuation despite some sizable margins of error.
## 0 1 2 3 4 5 6
## 76.678321 79.404977 81.821832 84.709256 87.067348 90.065500 93.004436
## 7 8 9 10 11 12 13
## 95.845072 98.378354 101.328693 104.280245 106.817091 0.000000 2.981370
## 14 15 16 17 18 19 20
## 5.030600 7.314068 10.325574 13.314553 16.268531 19.065963 22.066403
## 21 22 23 24 25 26 27
## 25.064034 28.075267 30.849452 23.216868 26.237140 28.286837 30.536356
## 28 29 30
## 32.854383 35.682019 38.703213
The AIC is lowest at an order of 12 therefore, that will be our max order.
The time-series plot of residuals, as well as the ACF and PACF plots, reveal that our the federal funds rate is likely not a suitable variable for an autoregressive model; the way data was originally distributed does not lend itself well to this type of analysis primarily because this variable does not fluctuate with time, but rather at the whim of the Federal Reserve. For this reason, we will unfortunately be discarding this variable moving forward.
The autoregressive model, as the name suggests, regresses a variable’s previous lags against itself to fit time-series data. While it is powerful for some variables, it may be less so for others and this is evident from our analysis above. The key insights we used are from the cross-validation and MSEs, in which the difference of diamond prices performed worst and overfit the data significantly, while inflation rate seemed to perform best in new data. Having accounted for stationarity, these results likely mean there are some other effects influencing the variables which are not captured within in the model.
Here, we are testing the AIC and BIC of our test data models to see which has the lowest and thus performs the best:
AIC(dia.test, gold.test, infr.test, inrt.test)
BIC(dia.test, gold.test, infr.test, inrt.test)
In both tests, it seems that the interest rate autoregressive model
performed best. This makes sense as the interest rate is the measure
that is least influenced by outside factors as the Fed only targets
inflation and the funds rate, therefore movements in interest are
somewhat more autonomous. However, it is important to keep in mind that
while AIC and BIC are useful measures, they have a strong preference
towards less complex models thus, even if our model is very accurate,
too much complexity (ie high order.max) will be penalized
more.
An Autoregressive Distributed Lag Model (ARDL) is similar to an autoregressive model but it instead compares not only the target variable to its own lags but to the lags of other variables as well. It takes the general form:
\(Y_t= α +β_1 Y_{t-1} + β_2 Y_{t-2} + ...+ β_p Y_{t-p}+ γ_1 X_{1,t-1} + γ_2 X_{1,t-2} + ...+ γ_p X_{1,t-p} +ε_t\)
Therefore, for this type of model, I’ll be using them on diamond prices and gold prices with inflation and interest to see which model fits best. I imagine that this model should be more powerful as prices are typically heavily influenced to inflation and interest.
# Again, only showing the code for the first variable
library(dynlm)
Du_ts = diff(diampr.ts)
Fu_ts = diff(infl.ts)
Hu_ts = diff(intr.ts)
# Diamond vs only inflation
diapr.ardl1 = dynlm(Du_ts ~ L(Du_ts,1) + L(Fu_ts,0:2))
summary(diapr.ardl1)
##
## Time series regression with "ts" data:
## Start = 1(4), End = 1134(1)
##
## Call:
## dynlm(formula = Du_ts ~ L(Du_ts, 1) + L(Fu_ts, 0:2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -464.84 -3.69 0.13 3.64 247.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001767 0.237900 0.007 0.9941
## L(Du_ts, 1) -0.446461 0.010858 -41.119 <2e-16 ***
## L(Fu_ts, 0:2)0 2.217276 7.875421 0.282 0.7783
## L(Fu_ts, 0:2)1 -22.093750 8.929340 -2.474 0.0134 *
## L(Fu_ts, 0:2)2 3.337455 7.879785 0.424 0.6719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.61 on 6791 degrees of freedom
## Multiple R-squared: 0.2014, Adjusted R-squared: 0.201
## F-statistic: 428.3 on 4 and 6791 DF, p-value: < 2.2e-16
tsdisplay(diapr.ardl1$resid)
# Diamond vs only interest
diapr.ardl2 = dynlm(Du_ts ~ L(Du_ts,1) + L(Hu_ts,0:1))
summary(diapr.ardl2)
##
## Time series regression with "ts" data:
## Start = 1(3), End = 1133(7)
##
## Call:
## dynlm(formula = Du_ts ~ L(Du_ts, 1) + L(Hu_ts, 0:1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -463.56 -3.91 0.07 3.62 246.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.431e-04 2.378e-01 0.003 0.9975
## L(Du_ts, 1) -4.459e-01 1.085e-02 -41.106 <2e-16 ***
## L(Hu_ts, 0:1)0 1.012e+01 5.595e+00 1.809 0.0705 .
## L(Hu_ts, 0:1)1 -1.388e+01 5.599e+00 -2.479 0.0132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.61 on 6793 degrees of freedom
## Multiple R-squared: 0.2016, Adjusted R-squared: 0.2013
## F-statistic: 571.9 on 3 and 6793 DF, p-value: < 2.2e-16
tsdisplay(diapr.ardl2$resid)
# Diamond vs interest and inflation
diapr.ardl3 = dynlm(Du_ts ~ L(Du_ts,1) + L(Fu_ts,0:1) + L(Hu_ts,0:1))
summary(diapr.ardl3)
##
## Time series regression with "ts" data:
## Start = 1(3), End = 1133(7)
##
## Call:
## dynlm(formula = Du_ts ~ L(Du_ts, 1) + L(Fu_ts, 0:1) + L(Hu_ts,
## 0:1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.75 -3.85 0.16 3.71 246.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.789e-04 2.375e-01 0.004 0.997047
## L(Du_ts, 1) -4.451e-01 1.084e-02 -41.052 < 2e-16 ***
## L(Fu_ts, 0:1)0 3.491e+00 7.900e+00 0.442 0.658605
## L(Fu_ts, 0:1)1 -3.090e+01 7.897e+00 -3.913 9.21e-05 ***
## L(Hu_ts, 0:1)0 1.037e+01 5.834e+00 1.778 0.075447 .
## L(Hu_ts, 0:1)1 -2.001e+01 5.843e+00 -3.425 0.000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.58 on 6791 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.2037
## F-statistic: 348.7 on 5 and 6791 DF, p-value: < 2.2e-16
tsdisplay(diapr.ardl3$resid)
# Compare
AIC(diapr.ardl1,diapr.ardl2,diapr.ardl3)
## Warning in AIC.default(diapr.ardl1, diapr.ardl2, diapr.ardl3): models are not
## all fitted to the same number of observations
BIC(diapr.ardl1,diapr.ardl2,diapr.ardl3)
## Warning in BIC.default(diapr.ardl1, diapr.ardl2, diapr.ardl3): models are not
## all fitted to the same number of observations
Here we are comparing three different ways of modeling the ARDL function of diamond prices by comparing against the interest rate, inflation rate, and both of these together. It is important to note that for this we are using the difference of the variables. As per the residual plots, we can see that the PACF shows the first few lags are significantly negatively correlated and that our model has some structure which is a bit concerning. Ultimately, given the AIC and BIC of each model, it seems that diamond price is best modeled with both inflation and interest rates. This could make sense given that there is likely some multicollinearity occurring between the inflation and interest rates as these two economic factors are related.
set.seed(1)
Row.number <- sample(1:nrow(differenced_data), 0.67*nrow(differenced_data))
Trains = differenced_data[Row.number,]
Tests = differenced_data[-Row.number,]
dim(Trains)
## [1] 759 11
dim(Tests)
## [1] 374 11
infl_tts = ts(Trains$`inflation rate`,start=c(1), end=c(1134), frequency=6)
intr_tts = ts(Trains$`interest rate`,start=c(1), end=c(1134), frequency=6)
diapr_tts = ts(Trains$`diamond price`,start=c(1), end=c(1134), frequency=6)
du_tts = diff(diapr_tts)
fu_tts = diff(infl_tts)
hu_tts = diff(intr_tts)
diadyn.tr = dynlm(du_tts ~ L(du_tts,1) + L(fu_tts,1)+L(hu_tts,1), data = Trains)
infl_tst = ts(Tests$`inflation rate`,start=c(1), end=c(1134), frequency=6)
intr_tst = ts(Tests$`interest rate`,start=c(1), end=c(1134), frequency=6)
diapr_tst = ts(Tests$`diamond price`,start=c(1), end=c(1134), frequency=6)
du_tst = diff(diapr_tst)
fu_tst = diff(infl_tst)
hu_tst = diff(intr_tst)
diadyn.te = dynlm(du_tst ~ L(du_tst,1) + L(fu_tst,1)+L(hu_tst), data = Tests)
mean((Trains$`diamond price` - predict(diadyn.tr, Trains)) ^2)
## [1] 38.41845
mean((Tests$`diamond price` - predict(diadyn.te, Tests)) ^2)
## [1] 347.2272
The test-train cross-validation for our ARDL model produced a train MSE of 38.4 and a test MSE of 347.0. We used inflation rate as the independent variable because this was the best model given our AIC and BIC. Evidently, there is a pretty substantial difference between these results and show that our model is overfitting the data and does not perform well in a general setting. However, this was also the case for the previous AR model so this doesn’t serve to elucidate anything about which model to use.
# Forecasting the ardl model
library(dLagM)
## Loading required package: nardl
##
## Attaching package: 'dLagM'
## The following object is masked from 'package:forecast':
##
## forecast
## The following object is masked from 'package:generics':
##
## forecast
data = as.data.frame(differenced_data[1:1133,])
du_dts <- diff((ts(data$diamond,start=c(1), end=c(1134), frequency=1)))
fu_dts <- diff((ts(data$inflation,start=c(1), end=c(1134), frequency=1)))
hu_dts <- diff((ts(data$interest,start=c(1), end=c(1134), frequency=1)))
# Fitting the model for forecasting
diapr.ardl2Dlm = ardlDlm(formula = du_dts ~ fu_dts + hu_dts, data, p = 1, q=2)
x.new = matrix(c(1:10), ncol = 10, nrow = 2)
diapr.ardl2f <- dLagM::forecast(diapr.ardl2Dlm, x = x.new, h = 10,
interval = TRUE, level = 0.90, nSim = 100)
# Extracting the forecasts and taking the logarithm
diafdifferences <- log(abs(c(diapr.ardl2f$forecast$Forecast)))
series <- vector(length = 10)
series[1] <- 10385.16
for (i in 2:10) {
series[i] <- series[i-1] + diafdifferences[i]
}
# Transforming this into a dataset and combining with the dates.
forecasteddiaf <- c(rev(series),dia$`diamond price`)
forecasteddiaf <- as.data.frame(forecasteddiaf)
forecasteddiaf$dates <- c(as.Date(c("2021-06-14","2021-06-13","2021-06-12","2021-06-11","2021-06-10","2021-06-09","2021-06-08","2021-06-07","2021-06-06","2021-06-05")), dia$`date`)
forecasteddiaf$forecast <- forecasteddiaf$dates > "2021-06-05"
# Plotting between April and June 2021.
frplot.diaf <- ggplot(data = forecasteddiaf, aes(x = dates, y = forecasteddiaf, group = forecast)) +
geom_point() +
geom_smooth(aes(color = forecast), method = "loess") +
xlim(as.Date("2021-04-01"), as.Date("2021-06-14")) + ylim(10000, 10500) +
ggtitle("Diamond Price ARDL Forecast (April 2021 - June 2021)") +
xlab("Date") + ylab("Price USD")
frplot.diaf
## `geom_smooth()` using formula = 'y ~ x'
Our forecast for the ARDL model for diamond prices based on inflation rates shows a somewhat steeper than for the AR model with the data points varying significantly less. Given the residuals, the cross-validation, and the shape of this forecast, I am inclined to believe that the basic autoregressive model is better for diamond prices. This is unexpected because I would expect there to be some relationship between inflation and prices as these are economically related.
##
## Time series regression with "ts" data:
## Start = 1(3), End = 1133(7)
##
## Call:
## dynlm(formula = Gu_ts ~ L(Gu_ts, 1) + L(Fu_ts, 0:1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.855 -8.966 0.431 9.029 208.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002468 0.262075 -0.009 0.9925
## L(Gu_ts, 1) -0.489841 0.010596 -46.227 <2e-16 ***
## L(Fu_ts, 0:1)0 19.887583 8.350419 2.382 0.0173 *
## L(Fu_ts, 0:1)1 -12.422795 8.341498 -1.489 0.1365
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.61 on 6793 degrees of freedom
## Multiple R-squared: 0.2404, Adjusted R-squared: 0.2401
## F-statistic: 716.7 on 3 and 6793 DF, p-value: < 2.2e-16
##
## Time series regression with "ts" data:
## Start = 1(4), End = 1134(1)
##
## Call:
## dynlm(formula = Gu_ts ~ L(Gu_ts, 1) + L(Hu_ts, 0:2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.085 -8.837 0.231 8.610 205.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003344 0.257025 -0.013 0.990
## L(Gu_ts, 1) -0.488737 0.010586 -46.168 < 2e-16 ***
## L(Hu_ts, 0:2)0 -98.028416 6.227255 -15.742 < 2e-16 ***
## L(Hu_ts, 0:2)1 -29.768737 6.894826 -4.318 1.6e-05 ***
## L(Hu_ts, 0:2)2 14.272281 6.231757 2.290 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.19 on 6791 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2693
## F-statistic: 627 on 4 and 6791 DF, p-value: < 2.2e-16
##
## Time series regression with "ts" data:
## Start = 1(3), End = 1133(7)
##
## Call:
## dynlm(formula = Gu_ts ~ L(Gu_ts, 1) + L(Fu_ts, 1) + L(Hu_ts,
## 0:1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.759 -8.773 0.170 8.499 206.015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.125e-03 2.569e-01 -0.008 0.99340
## L(Gu_ts, 1) -4.894e-01 1.059e-02 -46.234 < 2e-16 ***
## L(Fu_ts, 1) -2.409e+01 7.456e+00 -3.231 0.00124 **
## L(Hu_ts, 0:1)0 -1.018e+02 6.044e+00 -16.846 < 2e-16 ***
## L(Hu_ts, 0:1)1 -4.189e+01 6.340e+00 -6.607 4.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 6792 degrees of freedom
## Multiple R-squared: 0.2703, Adjusted R-squared: 0.2698
## F-statistic: 628.9 on 4 and 6792 DF, p-value: < 2.2e-16
In this case, when looking at the ARDL models for gold price, our best model in both R-squared and in the measures of AIC and BIC is when using only interest as the independent variable. The PACF does have some structure to the lags, with almost all of the ones below 20 being negative and significant. This is not great but should work well enough moving forward.
## [1] 164.6818
## [1] 107.0751
The gold price model cross-validation reveals that our model performs fairly well and, similar to its AR model, performs well in a general setting. The train MSE is 165.7 while the test MSE is 107.1.
## `geom_smooth()` using formula = 'y ~ x'
The ARDL model forecast in gold prices is interesting; it continues well
from the existing data however, it predicts a much steeper decline than
in the autoregressive model. Nevertheless, the forecast points to the
price dipping slightly in the next ten days with the peak being just
over 1900 USD. The lower MSE of this model as well as the residual plot
and the fit of the forecast seems to indicate that the ARDL model is
performing better than the AR model for gold prices.
Vector Autoregressive (VAR) models take the principles of ARDL model (ie regressing on multiple variables’ lags) and expands on them by having two or more dependent/endogenous variables and using the lags of both in a system of simultaneous equations to model the variables. It has the general form:
\(Y_t=A_1 Y_{t-1} + A_2 Y_{t-2} + ...+ A_p Y_{t-p} +ε_t\)
While this may sound confusing, it is essentially a more robust method which treats every variable as endogenous and is used to look at the dynamic interactions between variables.
For this section, I will change up the structure of the code and instead develop both the gold and diamond models concurrently. Therefore, it will be easier to compare them. So, firstly we will be fitting the models
# Creating a flipped dataset
dia_data <- as.data.frame(ts(dia))
dia_data$diamond <- rev(dia_data$`diamond price`)
dia_data$gold <- rev(dia_data$`gold price`)
dia_data$interest <- rev(dia_data$`interest rate`)
dia_data$inflation <- rev(dia_data$`inflation rate`)
# As before, I will show how the code works for diamonds and not show it for gold for the sake of brevity
# First, we define our endogenous variables
dia.vara=cbind(dia_data$inflation, dia_data$diamond)
dia.a_tot= data.frame(dia.vara)
# Then, select the appropriate lags
VARselect(dia.a_tot, lag.max = 10)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 2 2 1 2
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) -1.6584950 -1.672538 -1.6686197 -1.6625663 -1.6571847 -1.6529197
## HQ(n) -1.6483588 -1.655644 -1.6449685 -1.6321576 -1.6200186 -1.6089961
## SC(n) -1.6316731 -1.627834 -1.6060351 -1.5821004 -1.5588375 -1.5366911
## FPE(n) 0.1904254 0.187770 0.1885072 0.1896518 0.1906753 0.1914904
## 7 8 9 10
## AIC(n) -1.6670199 -1.6607530 -1.6604239 -1.6543049
## HQ(n) -1.6163388 -1.6033145 -1.5962279 -1.5833514
## SC(n) -1.5329100 -1.5087619 -1.4905515 -1.4665511
## FPE(n) 0.1888095 0.1899967 0.1900596 0.1912266
# Fit the model with correct lags
dia.vara_model=VAR(dia.a_tot,p=2)
summary(dia.vara_model)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2
## Deterministic variables: const
## Sample size: 1132
## Log Likelihood: -2248.696
## Roots of the characteristic polynomial:
## 0.9959 0.9959 0.1012 0.1012
## Call:
## VAR(y = dia.a_tot, p = 2)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.102e+00 2.962e-02 37.193 < 2e-16 ***
## X2.l1 7.168e-05 4.868e-05 1.473 0.141140
## X1.l2 -1.031e-01 2.979e-02 -3.460 0.000561 ***
## X2.l2 -7.357e-05 4.840e-05 -1.520 0.128832
## const 2.165e-02 2.304e-02 0.940 0.347624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.02652 on 1127 degrees of freedom
## Multiple R-Squared: 0.9941, Adjusted R-squared: 0.9941
## F-statistic: 4.749e+04 on 4 and 1127 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -2.35441 18.05915 -0.130 0.896295
## X2.l1 1.08217 0.02968 36.464 < 2e-16 ***
## X1.l2 13.28547 18.16558 0.731 0.464715
## X2.l2 -0.08905 0.02951 -3.017 0.002606 **
## const 49.49294 14.04701 3.523 0.000443 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.17 on 1127 degrees of freedom
## Multiple R-Squared: 0.9985, Adjusted R-squared: 0.9985
## F-statistic: 1.847e+05 on 4 and 1127 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2
## X1 0.0007035 0.0128
## X2 0.0127988 261.5024
##
## Correlation matrix of residuals:
## X1 X2
## X1 1.00000 0.02984
## X2 0.02984 1.00000
# Repeat with interest rates
dia.varb=cbind(dia_data$interest, dia_data$diamond)
dia.b_tot= data.frame(dia.varb)
VARselect(dia.b_tot, lag.max = 10)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 4 2 2 4
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) -1.0748074 -1.0958531 -1.0939278 -1.1034715 -1.0972617 -1.0948353
## HQ(n) -1.0646711 -1.0789594 -1.0702767 -1.0730628 -1.0600956 -1.0509117
## SC(n) -1.0479854 -1.0511498 -1.0313432 -1.0230056 -0.9989145 -0.9786068
## FPE(n) 0.3413635 0.3342544 0.3348986 0.3317178 0.3337842 0.3345954
## 7 8 9 10
## AIC(n) -1.0958407 -1.0908877 -1.0853723 -1.0823254
## HQ(n) -1.0451596 -1.0334491 -1.0211763 -1.0113719
## SC(n) -0.9617309 -0.9388965 -0.9154999 -0.8945716
## FPE(n) 0.3342595 0.3359197 0.3377782 0.3388097
dia.varb_model=VAR(dia.b_tot,p=4)
summary(dia.varb_model)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2
## Deterministic variables: const
## Sample size: 1130
## Log Likelihood: -2559.733
## Roots of the characteristic polynomial:
## 0.9992 0.9992 0.5321 0.5321 0.4208 0.2389 0.2389 0.2145
## Call:
## VAR(y = dia.b_tot, p = 4)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.126e+00 2.968e-02 37.949 < 2e-16 ***
## X2.l1 3.665e-05 6.347e-05 0.577 0.563769
## X1.l2 -1.615e-01 4.492e-02 -3.595 0.000339 ***
## X2.l2 -4.611e-05 9.458e-05 -0.488 0.625971
## X1.l3 -9.464e-02 4.500e-02 -2.103 0.035679 *
## X2.l3 5.640e-05 9.450e-05 0.597 0.550739
## X1.l4 1.253e-01 2.970e-02 4.218 2.66e-05 ***
## X2.l4 -3.868e-05 6.367e-05 -0.608 0.543602
## const -8.407e-02 3.765e-02 -2.233 0.025739 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.03477 on 1121 degrees of freedom
## Multiple R-Squared: 0.9977, Adjusted R-squared: 0.9977
## F-statistic: 6.051e+04 on 8 and 1121 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -9.657567 13.968599 -0.691 0.4895
## X2.l1 1.107864 0.029867 37.093 <2e-16 ***
## X1.l2 33.374737 21.140432 1.579 0.1147
## X2.l2 -0.109735 0.044512 -2.465 0.0138 *
## X1.l3 -39.066700 21.177029 -1.845 0.0653 .
## X2.l3 -0.003202 0.044472 -0.072 0.9426
## X1.l4 12.922117 13.977167 0.925 0.3554
## X2.l4 0.007627 0.029962 0.255 0.7991
## const -25.269890 17.718203 -1.426 0.1541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.36 on 1121 degrees of freedom
## Multiple R-Squared: 0.9984, Adjusted R-squared: 0.9984
## F-statistic: 9.018e+04 on 8 and 1121 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2
## X1 0.001209 0.01859
## X2 0.018590 267.71756
##
## Correlation matrix of residuals:
## X1 X2
## X1 1.00000 0.03268
## X2 0.03268 1.00000
# Repeat with both inflation and interest
dia.varc=cbind(dia_data$inflation, dia_data$interest, dia_data$diamond)
dia.c_tot= data.frame(dia.varc)
VARselect(dia.c_tot, lag.max = 5)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 4 2 1 4
##
## $criteria
## 1 2 3 4 5
## AIC(n) -8.439195888 -8.4614334774 -8.4569495044 -8.4824910477 -8.4782967436
## HQ(n) -8.418999816 -8.4260903507 -8.4064593234 -8.4168538124 -8.3975124540
## SC(n) -8.385742345 -8.3678897761 -8.3233156453 -8.3087670309 -8.2644825690
## FPE(n) 0.000216224 0.0002114688 0.0002124194 0.0002070629 0.0002079337
dia.varc_model=VAR(dia.c_tot,p=3)
summary(dia.varc_model)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2, X3
## Deterministic variables: const
## Sample size: 1131
## Log Likelihood: 0.024
## Roots of the characteristic polynomial:
## 0.9984 0.9984 0.9788 0.3157 0.2193 0.2193 0.215 0.1674 0.1674
## Call:
## VAR(y = dia.c_tot, p = 3)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.082e+00 3.113e-02 34.757 <2e-16 ***
## X2.l1 -3.930e-02 2.359e-02 -1.666 0.0960 .
## X3.l1 7.946e-05 4.900e-05 1.622 0.1052
## X1.l2 -3.205e-02 4.572e-02 -0.701 0.4834
## X2.l2 6.737e-02 3.514e-02 1.917 0.0555 .
## X3.l2 -5.967e-05 7.189e-05 -0.830 0.4067
## X1.l3 -5.577e-02 3.128e-02 -1.783 0.0748 .
## X2.l3 -3.075e-02 2.356e-02 -1.305 0.1922
## X3.l3 -1.564e-05 4.871e-05 -0.321 0.7483
## const -3.101e-02 4.734e-02 -0.655 0.5126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.02651 on 1121 degrees of freedom
## Multiple R-Squared: 0.9941, Adjusted R-squared: 0.9941
## F-statistic: 2.111e+04 on 9 and 1121 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -3.264e-02 4.116e-02 -0.793 0.427864
## X2.l1 1.126e+00 3.119e-02 36.089 < 2e-16 ***
## X3.l1 3.123e-05 6.478e-05 0.482 0.629878
## X1.l2 5.703e-02 6.044e-02 0.944 0.345613
## X2.l2 -1.741e-01 4.646e-02 -3.748 0.000188 ***
## X3.l2 -2.657e-05 9.504e-05 -0.280 0.779890
## X1.l3 -2.781e-02 4.135e-02 -0.672 0.501415
## X2.l3 4.264e-02 3.115e-02 1.369 0.171362
## X3.l3 7.039e-06 6.440e-05 0.109 0.912983
## const -1.122e-01 6.259e-02 -1.792 0.073327 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.03504 on 1121 degrees of freedom
## Multiple R-Squared: 0.9977, Adjusted R-squared: 0.9976
## F-statistic: 5.3e+04 on 9 and 1121 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X3:
## ===================================
## X3 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -3.89091 18.96145 -0.205 0.837452
## X2.l1 -7.65291 14.36964 -0.533 0.594433
## X3.l1 1.08221 0.02984 36.262 < 2e-16 ***
## X1.l2 46.72944 27.84679 1.678 0.093608 .
## X2.l2 39.95935 21.40339 1.867 0.062168 .
## X3.l2 -0.11047 0.04379 -2.523 0.011773 *
## X1.l3 -28.48686 19.05056 -1.495 0.135110
## X2.l3 -29.65620 14.35242 -2.066 0.039031 *
## X3.l3 0.01604 0.02967 0.541 0.588950
## const 96.80988 28.83409 3.357 0.000813 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.14 on 1121 degrees of freedom
## Multiple R-Squared: 0.9985, Adjusted R-squared: 0.9985
## F-statistic: 8.235e+04 on 9 and 1121 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2 X3
## X1 0.0007026 -0.0002634 0.01196
## X2 -0.0002634 0.0012279 0.02166
## X3 0.0119608 0.0216585 260.62336
##
## Correlation matrix of residuals:
## X1 X2 X3
## X1 1.00000 -0.28355 0.02795
## X2 -0.28355 1.00000 0.03829
## X3 0.02795 0.03829 1.00000
The results of fitting our diamond model give us much to think about; while the models for interest and inflation are great for their simplicity and high R-squared values, they have very negative log-likelihood functions which means they are not fitting the data very well. Alternatively, the model with all three variables has a very high log-likelihood, but I suspect it is more prone to overfitting the data given the high number of lags the AIC suggests its optimal. Therefore, to try correct this, the VARselect for the final model was limited to 5 lags of which it found the lags up to the 4th to be best for the model. Nevertheless, while this is likely the strongest model, it limits me in how rigorously I can test it given that many functions are incompatible with the number of variables in the model. So moving forward I will be using the VAR model of diamond price and inflation
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 3 2 5
##
## $criteria
## 1 2 3 4 5
## AIC(n) -1.5946798 -1.6133101 -1.6240837 -1.6206065 -1.6309348
## HQ(n) -1.5845818 -1.5964800 -1.6005216 -1.5903124 -1.5939087
## SC(n) -1.5679531 -1.5687655 -1.5617212 -1.5404262 -1.5329366
## FPE(n) 0.2029735 0.1992271 0.1970923 0.1977788 0.1957467
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2
## Deterministic variables: const
## Sample size: 1129
## Log Likelihood: -2261.301
## Roots of the characteristic polynomial:
## 0.998 0.998 0.5774 0.5774 0.5122 0.5122 0.5115 0.5115 0.4624 0.4624
## Call:
## VAR(y = g.a_tot, p = 5)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + X1.l5 + X2.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.093e+00 2.986e-02 36.610 < 2e-16 ***
## X2.l1 1.328e-04 4.740e-05 2.801 0.00518 **
## X1.l2 -5.121e-02 4.417e-02 -1.159 0.24662
## X2.l2 -1.239e-04 6.376e-05 -1.944 0.05219 .
## X1.l3 -6.278e-02 4.415e-02 -1.422 0.15533
## X2.l3 -1.333e-05 6.372e-05 -0.209 0.83435
## X1.l4 5.773e-02 4.413e-02 1.308 0.19110
## X2.l4 -1.266e-04 6.392e-05 -1.981 0.04782 *
## X1.l5 -3.928e-02 2.988e-02 -1.314 0.18896
## X2.l5 1.350e-04 4.754e-05 2.841 0.00458 **
## const -1.802e-03 7.193e-03 -0.251 0.80221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.0264 on 1118 degrees of freedom
## Multiple R-Squared: 0.9942, Adjusted R-squared: 0.9941
## F-statistic: 1.911e+04 on 10 and 1118 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + X1.l5 + X2.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 16.33144 18.80993 0.868 0.38545
## X2.l1 0.89494 0.02986 29.975 < 2e-16 ***
## X1.l2 6.38567 27.82339 0.230 0.81852
## X2.l2 -0.02903 0.04016 -0.723 0.46992
## X1.l3 7.42366 27.81010 0.267 0.78956
## X2.l3 0.11416 0.04013 2.844 0.00453 **
## X1.l4 20.99177 27.79779 0.755 0.45031
## X2.l4 -0.04167 0.04026 -1.035 0.30098
## X1.l5 -53.28733 18.82301 -2.831 0.00472 **
## X2.l5 0.05916 0.02994 1.976 0.04842 *
## const 8.29898 4.53089 1.832 0.06727 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.63 on 1118 degrees of freedom
## Multiple R-Squared: 0.9957, Adjusted R-squared: 0.9957
## F-statistic: 2.587e+04 on 10 and 1118 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2
## X1 0.0006972 0.02902
## X2 0.0290170 276.59246
##
## Correlation matrix of residuals:
## X1 X2
## X1 1.00000 0.06608
## X2 0.06608 1.00000
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 5 4 5
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) -1.1149406 -1.1341504 -1.1509851 -1.1826051 -1.1914706 -1.1856966
## HQ(n) -1.1048044 -1.1172567 -1.1273339 -1.1521965 -1.1543045 -1.1417730
## SC(n) -1.0881186 -1.0894471 -1.0884005 -1.1021392 -1.0931234 -1.0694681
## FPE(n) 0.3279348 0.3216953 0.3163251 0.3064795 0.3037746 0.3055339
## 7 8 9 10
## AIC(n) -1.1846935 -1.1833478 -1.180764 -1.1778015
## HQ(n) -1.1340124 -1.1259092 -1.116568 -1.1068481
## SC(n) -1.0505837 -1.0313566 -1.010892 -0.9900478
## FPE(n) 0.3058409 0.3062532 0.307046 0.3079577
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2
## Deterministic variables: const
## Sample size: 1129
## Log Likelihood: -2504.91
## Roots of the characteristic polynomial:
## 0.9992 0.9714 0.6406 0.6406 0.533 0.533 0.4407 0.4116 0.4116 0.2324
## Call:
## VAR(y = g.b_tot, p = 5)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + X1.l5 + X2.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.120e+00 3.112e-02 35.997 < 2e-16 ***
## X2.l1 -7.114e-05 6.624e-05 -1.074 0.283115
## X1.l2 -1.581e-01 4.599e-02 -3.438 0.000608 ***
## X2.l2 4.706e-05 8.579e-05 0.549 0.583422
## X1.l3 -1.010e-01 4.619e-02 -2.188 0.028892 *
## X2.l3 -5.900e-05 8.562e-05 -0.689 0.490885
## X1.l4 1.171e-01 4.604e-02 2.544 0.011088 *
## X2.l4 -4.143e-06 8.581e-05 -0.048 0.961497
## X1.l5 8.456e-03 3.172e-02 0.267 0.789835
## X2.l5 5.115e-05 6.510e-05 0.786 0.432210
## const 5.430e-02 3.535e-02 1.536 0.124856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.03483 on 1118 degrees of freedom
## Multiple R-Squared: 0.9977, Adjusted R-squared: 0.9977
## F-statistic: 4.818e+04 on 10 and 1118 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X1.l2 + X2.l2 + X1.l3 + X2.l3 + X1.l4 + X2.l4 + X1.l5 + X2.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -56.30148 14.51594 -3.879 0.000111 ***
## X2.l1 0.82667 0.03090 26.751 < 2e-16 ***
## X1.l2 -0.72431 21.45407 -0.034 0.973074
## X2.l2 -0.02668 0.04002 -0.667 0.505104
## X1.l3 -10.83532 21.54647 -0.503 0.615147
## X2.l3 0.11599 0.03994 2.904 0.003756 **
## X1.l4 5.49808 21.48004 0.256 0.798027
## X2.l4 -0.02611 0.04003 -0.652 0.514295
## X1.l5 51.29978 14.79774 3.467 0.000547 ***
## X2.l5 0.07805 0.03037 2.570 0.010296 *
## const 50.02039 16.49304 3.033 0.002479 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.25 on 1118 degrees of freedom
## Multiple R-Squared: 0.9959, Adjusted R-squared: 0.9959
## F-statistic: 2.71e+04 on 10 and 1118 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2
## X1 0.001213 -0.1574
## X2 -0.157412 264.0494
##
## Correlation matrix of residuals:
## X1 X2
## X1 1.0000 -0.2781
## X2 -0.2781 1.0000
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 5 1 5
##
## $criteria
## 1 2 3 4 5
## AIC(n) -8.4512483830 -8.4757598806 -8.4888493738 -8.5381896828 -8.5593505073
## HQ(n) -8.4310523106 -8.4404167539 -8.4383591928 -8.4725524475 -8.4785662177
## SC(n) -8.3977948394 -8.3822161792 -8.3552155147 -8.3644656659 -8.3455363328
## FPE(n) 0.0002136336 0.0002084608 0.0002057502 0.0001958451 0.0001917449
##
## VAR Estimation Results:
## =========================
## Endogenous variables: X1, X2, X3
## Deterministic variables: const
## Sample size: 1129
## Log Likelihood: 73.809
## Roots of the characteristic polynomial:
## 0.998 0.998 0.9713 0.6991 0.6991 0.604 0.604 0.5614 0.5614 0.5351 0.5351 0.498 0.498 0.4253 0.1787
## Call:
## VAR(y = g.c_tot, p = 5)
##
##
## Estimation results for equation X1:
## ===================================
## X1 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + X1.l4 + X2.l4 + X3.l4 + X1.l5 + X2.l5 + X3.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 1.088e+00 3.102e-02 35.060 < 2e-16 ***
## X2.l1 -2.553e-02 2.468e-02 -1.034 0.301255
## X3.l1 1.228e-04 5.005e-05 2.453 0.014309 *
## X1.l2 -3.239e-02 4.558e-02 -0.711 0.477476
## X2.l2 5.099e-02 3.622e-02 1.408 0.159513
## X3.l2 -1.077e-04 6.489e-05 -1.660 0.097149 .
## X1.l3 -5.358e-02 4.549e-02 -1.178 0.239091
## X2.l3 3.512e-02 3.603e-02 0.975 0.329878
## X3.l3 -1.101e-07 6.480e-05 -0.002 0.998645
## X1.l4 9.889e-03 4.590e-02 0.215 0.829459
## X2.l4 -1.236e-01 3.603e-02 -3.431 0.000624 ***
## X3.l4 -1.710e-04 6.493e-05 -2.634 0.008557 **
## X1.l5 -1.365e-02 3.135e-02 -0.436 0.663257
## X2.l5 6.309e-02 2.506e-02 2.517 0.011968 *
## X3.l5 1.601e-04 4.932e-05 3.247 0.001202 **
## const -2.092e-03 2.816e-02 -0.074 0.940800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.0263 on 1113 degrees of freedom
## Multiple R-Squared: 0.9943, Adjusted R-squared: 0.9942
## F-statistic: 1.285e+04 on 15 and 1113 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X2:
## ===================================
## X2 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + X1.l4 + X2.l4 + X3.l4 + X1.l5 + X2.l5 + X3.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 -2.020e-02 4.057e-02 -0.498 0.61856
## X2.l1 1.106e+00 3.228e-02 34.249 < 2e-16 ***
## X3.l1 -6.663e-05 6.546e-05 -1.018 0.30896
## X1.l2 6.511e-02 5.961e-02 1.092 0.27497
## X2.l2 -1.361e-01 4.737e-02 -2.874 0.00413 **
## X3.l2 4.937e-05 8.486e-05 0.582 0.56086
## X1.l3 -2.343e-01 5.949e-02 -3.939 8.7e-05 ***
## X2.l3 -1.510e-01 4.712e-02 -3.206 0.00139 **
## X3.l3 -6.603e-05 8.474e-05 -0.779 0.43602
## X1.l4 9.513e-02 6.003e-02 1.585 0.11329
## X2.l4 1.345e-01 4.712e-02 2.855 0.00439 **
## X3.l4 2.063e-05 8.491e-05 0.243 0.80808
## X1.l5 1.004e-01 4.100e-02 2.449 0.01447 *
## X2.l5 3.500e-02 3.278e-02 1.068 0.28580
## X3.l5 3.383e-05 6.450e-05 0.525 0.60003
## const 3.201e-02 3.683e-02 0.869 0.38495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.03439 on 1113 degrees of freedom
## Multiple R-Squared: 0.9978, Adjusted R-squared: 0.9977
## F-statistic: 3.295e+04 on 15 and 1113 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation X3:
## ===================================
## X3 = X1.l1 + X2.l1 + X3.l1 + X1.l2 + X2.l2 + X3.l2 + X1.l3 + X2.l3 + X3.l3 + X1.l4 + X2.l4 + X3.l4 + X1.l5 + X2.l5 + X3.l5 + const
##
## Estimate Std. Error t value Pr(>|t|)
## X1.l1 5.65233 19.16690 0.295 0.768125
## X2.l1 -50.88036 15.25205 -3.336 0.000878 ***
## X3.l1 0.82627 0.03093 26.715 < 2e-16 ***
## X1.l2 -2.42041 28.16656 -0.086 0.931536
## X2.l2 -5.66135 22.38348 -0.253 0.800373
## X3.l2 -0.02876 0.04010 -0.717 0.473364
## X1.l3 6.60034 28.10744 0.235 0.814388
## X2.l3 -8.19027 22.26196 -0.368 0.713015
## X3.l3 0.11653 0.04004 2.911 0.003679 **
## X1.l4 20.68432 28.36207 0.729 0.465974
## X2.l4 9.84089 22.26230 0.442 0.658544
## X3.l4 -0.02714 0.04012 -0.677 0.498817
## X1.l5 -32.52524 19.37023 -1.679 0.093406 .
## X2.l5 43.30515 15.48592 2.796 0.005256 **
## X3.l5 0.07865 0.03048 2.581 0.009982 **
## const 57.29314 17.40143 3.292 0.001024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 16.25 on 1113 degrees of freedom
## Multiple R-Squared: 0.9959, Adjusted R-squared: 0.9959
## F-statistic: 1.807e+04 on 15 and 1113 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## X1 X2 X3
## X1 0.0006916 -0.0002553 0.03123
## X2 -0.0002553 0.0011828 -0.15333
## X3 0.0312339 -0.1533314 264.05450
##
## Correlation matrix of residuals:
## X1 X2 X3
## X1 1.00000 -0.2823 0.07309
## X2 -0.28227 1.0000 -0.27436
## X3 0.07309 -0.2744 1.00000
We see a similar pattern emerge among the models for gold price with the most robust model being the one with all variables. Nevertheless, the issue remains the same so I must choose either interest or inflation as the other variable. Given the log-likelihoods of each model, it seems that the inflation rate model has a higher value and thus we will be using this model moving forward.
ccf(dia_data$inflation, dia_data$diamond, ylab="Cross-Correlation Function", main = "CCF")
ccf(dia_data$inflation, dia_data$gold,ylab="Cross-Correlation Function", main = "CCF")
The Cross-Correlation Function, much like the name suggests, is a measure of the correlation across the lags of the different variables. Here, we see that the CCF for inflation rate and diamond price show a positive correlation between the lags and the variables being considered. The lags around -20 are the most positively correlated and all the lags are statistically significant. Contrastingly, the CCF for inflation rate and gold price show a negative correlation between the lags and the variables being considered. The most negatively correlated lags are centered around lag -20 and here too all lags are statistically significant. Therefore, both of these plots show that the variables do affect each other and have a relationship across many lags.
grangertest(dia_data$`diamond price` ~ dia_data$`inflation rate`, order = 4)
grangertest(dia_data$`gold price` ~ dia_data$`inflation rate`, order = 4)
According to the granger causality test, inflation rate affects diamond price and gold price in a statistically significant manner. Using the previous CCF, we can infer that higher inflation rates may increase diamond price while also decreasing gold price. This difference is possibly due to the role of gold as a reserve asset during economic instability.
plot(irf(dia.vara_model, n.ahead=36))
plot(irf(g.vara_model, n.ahead=36))
An Impulse-Response Function (IRF) plot is an impulse response function that demonstrates what happens to the dependent variable when a shock of the explanatory variable occurs. The first IRF plot of inflation rate with diamond price shows that when a shock occurs with inflation rate the diamond price will decrease continually. The second IRF plot of inflation rate with gold prices shows that when a shock occurs with inflation rate the gold price will quickly stabilize. This is to be expected, again, because of gold’s role as a reserve asset.
tsdisplay(residuals(dia.vara_model)[,2])
tsdisplay(residuals(g.vara_model)[,2])
Both residual plots demonstrate that the data is stationary and returns to a mean for both models. Furthermore, the ACF and PACF plots show that there is no significant structure to our models which only reinforces the idea of stationarity in our models. The model of diamond prices show that there is only one significant lag at 23, while the gold model has significant lags at 6, 16, and 23.
set.seed(1)
Row.number <- sample(1:nrow(dia), 0.67*nrow(dia))
Trainn = dia[Row.number,]
Testn = dia[-Row.number,]
dim(Trainn)
## [1] 759 6
dim(Testn)
## [1] 375 6
infl_ttsn = ts(Trainn$`inflation rate`,start=c(1), end=c(1134), frequency=6)
diapr_ttsn = ts(Trainn$`diamond price`,start=c(1), end=c(1134), frequency=6)
gold_ttsn = ts(Trainn$`gold price`,start=c(1), end=c(1134), frequency=6)
du_ttsn = diff(diapr_ttsn)
fu_ttsn = diff(infl_ttsn)
gu_ttsn = diff(gold_ttsn)
Reg.moddiatn = dynlm(du_ttsn ~ L(du_ttsn,1) + L(fu_ttsn,1), data = Trainn)
Reg.modgoldtn = dynlm(gu_ttsn ~ L(gu_ttsn,1) + L(fu_ttsn,1), data = Trainn)
infl_tstn = ts(Testn$`inflation rate`,start=c(1), end=c(1134), frequency=6)
diapr_tstn = ts(Testn$`diamond price`,start=c(1), end=c(1134), frequency=6)
gold_tstn = ts(Testn$`gold price`,start=c(1), end=c(1134), frequency=6)
du_tstn = diff(diapr_tstn)
fu_tstn = diff(infl_ttsn)
gu_tstn = diff(gold_ttsn)
Reg.moddiats = dynlm(du_tstn ~ L(du_tstn,1) + L(fu_tstn,1), data = Testn)
Reg.modgoldts = dynlm(gu_tstn ~ L(gu_tstn,1) + L(fu_tstn,1), data = Testn)
mean((Trainn$`diamond price` - predict(Reg.moddiatn, Trainn)) ^2)
## [1] 100559570
mean((Testn$`diamond price` - predict(Reg.moddiats, Testn)) ^2)
## [1] 100165656
mean((Trainn$`gold price` - predict(Reg.modgoldtn, Trainn)) ^2)
## [1] 2382133
mean((Testn$`gold price` - predict(Reg.modgoldts, Testn)) ^2)
## [1] 2502909
Splitting the data into test and train datasets and comparing model performance across both shows that we had relatively similar values for the test and train MSEs of both gold and diamond. For gold the test MSE was higher than the train MSE while for diamond, the test MSE was lower than the train MSE. This means that the gold model is slightly overfitting while the diamond model seems to be performing well.
var.predict = predict(object=dia.vara_model, n.ahead=52)
# Extracting the forecasts
dia.varforecast <- ((c(var.predict$fcst$X2[1:52,1])))
# Transforming this into a dataset and combining with the dates.
forecasteddiavf <- c(rev(dia.varforecast),dia$`diamond price`)
forecasteddiavf <- as.data.frame(forecasteddiavf)
dates <- vector(length = 52)
dates[1] <- as_date("2021-06-05")
for (i in 2:52) {
dates[i] <- dates[i-1] + 1
}
forecasteddiavf$dates <- c(as_date(c(rev(dates))), dia_data$`date`)
forecasteddiavf$forecast <- forecasteddiavf$dates > "2021-06-05"
# Plotting between April 2018 and June 2021.
frplot.diavf <- ggplot(data = forecasteddiavf, aes(x = dates, y = forecasteddiavf, group = forecast)) +
geom_point() +
geom_smooth(aes(color = forecast), method = "loess") +
xlim(as.Date("2018-04-28"), as.Date("2021-07-26")) +
ggtitle("Diamond Price VAR Forecast (April 2018 - June 2021)") +
xlab("Date") + ylab("Price USD")
frplot.diavf
## `geom_smooth()` using formula = 'y ~ x'
var.predict.2 = predict(object=g.vara_model, n.ahead=52)
# Extracting the forecasts
g.varforecast <- ((c(var.predict.2$fcst$X2[1:52,1])))
# Transforming this into a dataset and combining with the dates.
forecastedgoldvf <- c(rev(g.varforecast),dia$`gold price`)
forecastedgoldvf <- as.data.frame(forecastedgoldvf)
forecastedgoldvf$dates <- c(as_date(c(rev(dates))), dia_data$`date`)
forecastedgoldvf$forecast <- forecastedgoldvf$dates > "2021-06-05"
# Plotting between April 2018 and June 2021.
frplot.goldvf <- ggplot(data = forecastedgoldvf, aes(x = dates, y = forecastedgoldvf, group = forecast)) +
geom_point() +
geom_smooth(aes(color = forecast), method = "loess") +
xlim(as.Date("2018-04-28"), as.Date("2021-07-26")) +
ggtitle("Gold Price VAR Forecast (April 2018 - June 2021)") +
xlab("Date") + ylab("Price USD")
frplot.goldvf
## `geom_smooth()` using formula = 'y ~ x'
Forecasting with VAR allows for more long-term predictions therefore, unlike with the AR and ARDL models where we only forecasted 10 days into the future, here we predicted 52 days into the future. That is why the scale of the dates has been made longer to better demonstrate how these forecasts fit the whole span of the data.
In the case of the diamond price forecast, the forecast seemingly follows the trend of the diamond price to increase and fits the data nicely. Contrastingly, gold price does also fit the data but predicts a decrease instead of increase and a much steeper one than with the ARDL forecast. Therefore, the VAR forecasts seem to work well and allow for further forecasting than the other models.
plot(fevd(dia.vara_model, n.ahead = 5))
plot(fevd(g.vara_model, n.ahead = 5))
The FEVD plot shows us the Forecast Error Variance Decomposition of both our models and how significant the shocks to either model will be to the forecast of errors. Based on these two plots the shocks to inflation rate will have very little effect on either of the models with some residual effect visible on the gold model.
The dataset I chose for this comparative time-series analysis revolved around various economic measures from 2018 to 2021. These variables are date, diamond price, gold price, inflation rate, interest rate, and the federal funds rate. The first stage of this analysis sought to understand the data through basic plots such as graphs, histograms, and summary tables. This revealed the general shape of each variable across time and gave me a basis for comparison for the upcoming forecasts. Following this, some pre-processing was necessary in converting the data into time-series and correcting for stationarity based on the residual and PACF plots of the non-corrected data suggesting that the data was not stationary. This allowed the data to be analyzed by both Autoregressive (AR) and Autoregressive Distributed Lag (ARDL) models.
In terms of the modeling, I ran an utoregressive model on every variable to see how well the current value of the variable is predicted by its previous lags. This process involved fitting the model, evaluating its performance, and then producing a forecast 10-steps into the future. With these models we found that the AR model was more powerful for the rates (ie interest and inflation) than with the prices. This makes sense given the both of these rate have economic relationships with price thus the AR model cannot capture this. Therefore, I expected that modeling with an ARDL model, which includes other variables, would be more powerful and modeled both gold and diamond prices using this method. Firstly, I determined that they would be best modeled with inflation instead of interest as the inflation rate seemed to perform best. The forecasts did seem to follow a similar trend to that of the AR models but were more robust and performed better despite some overfitting with the diamond ARDL model. However, the final model I used was a Vector Autoregression (VAR) which expands on the ARDL model and I expected to be more powerful in modeling these variables. This process differed from the other models as there are other effects that need to be tested for and analyzed to determine the suitability of the model. Nevertheless, the models performed well in these tests and fit the model well allowing for a more powerful forecast 52-steps into the future which fit the data and performed very well. Therefore, the results of this comparative time-series analysis indicated that both diamond and gold prices are best modeled using a Vector Autoregression with inflation rates.